Q1: Single-cell sequencing

Single-cell sequencing uses tiny nanoliter droplets that are created using microfluidics. Ideally, each droplet contains one individual cell and one bead that has a molecular barcode, used to uniquely label transcripts from that cell.

a. Rate of cells per droplet

If the volume of each droplet is 0.5 nanoliter, and cells are present at a constant concentration of 100 cells per microliter, what is the rate of cells included per droplet? (Hint: Remember to convert units!!!)

# Calculate the rate of the event (a cell being present in a droplet)
# Convert ul to nl: conc = (100/ul)/(1000/nl) = 0.1 cell / nl
cell.droplet.rate <- 0.1 * 0.5
print(paste("Rate is", cell.droplet.rate, "cells per droplet"))
## [1] "Rate is 0.05 cells per droplet"

What is the expected percentage of droplets that don’t have even a single cell?

print(paste0((1 - cell.droplet.rate)*100, "% of droplets do not contain a cell."))
## [1] "95% of droplets do not contain a cell."

Typically, an experiment might start out with around one million cells. Is the Poisson distribution a good model for these data?

# your answer here
The Poisson distribution measures the number of events that occur in a fixed interval when events occur at a constant rate. The event here is a cell being included in a droplet, concentration is the known rate of events occurring, and the droplet size is a fixed interval. 

Since p is low and n is high, and there may be more than one event per interval, the Poisson is a good model for this experimental setup. 

b. Probability mass function

Plot the PMF for the Poisson distribution using this rate (the PMF is a PDF for a discrete distribution):

  • Use one of R’s pois family of functions to generate an ideal Poisson PMF with the appropriate lambda parameter across a range of 0-10 cells.
  • Check that the total sum of the densities is equal to 1.
  • Turn this into a data frame with two columns: Cell_count and Density.
  • Use ggplot to plot a histogram of the probability distribution of cell counts per droplet.
    • Use stat="identity" in the geom layer
    • Use scale_x_continuous(breaks=pretty_breaks()) (pretty breaks is from the scales package)
# ideal probability mass function
range = c(0:10)
cell.droplet.pmf <- dpois(range, cell.droplet.rate)

# Check the sum of the PMF
cell.droplet.pmf.sum <- sum(cell.droplet.pmf)
cell.droplet.pmf.sum
## [1] 1
# Create the data frame
cell.droplet.pmf.df <- data.frame(Cell_count = range,
                                  Density    = cell.droplet.pmf)
cell.droplet.pmf.df
##    Cell_count      Density
## 1           0 9.512294e-01
## 2           1 4.756147e-02
## 3           2 1.189037e-03
## 4           3 1.981728e-05
## 5           4 2.477160e-07
## 6           5 2.477160e-09
## 7           6 2.064300e-11
## 8           7 1.474500e-13
## 9           8 9.215625e-16
## 10          9 5.119792e-18
## 11         10 2.559896e-20
# Plot the distribution
ggplot(cell.droplet.pmf.df, aes(x=Cell_count, y=Density)) +
  geom_histogram(stat="identity", fill="deepskyblue", color="black", alpha=0.25) +
  theme_classic() +
  # Use the `pretty_breaks()` function from the scales package to make the X-axis cleaner
  scale_x_continuous(breaks=pretty_breaks()) +
  xlab("Cells per droplet")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

c. PMF for different concentrations

The experimentalist can control the concentration of cells in this experiment. Plot the distribution of cell counts for experiments with 100, 1000, or 2000 cells/microliter.

  • Generate PDF data for each concentration of cells/ul

  • Make a data frame containing the distributions

  • Convert the data frame to long format before plotting

    • there are several functions to do this; try using melt from the reshape package:
      • ‘id.vars’ : independent variable (for x-axis)
      • ‘value.name’ : dependent variable (for y-axis)
      • ‘variable.name’ : category to which each datapoint belongs
  • Use ggplot2 to make a histogram showing all three distributions:

    • use stat="identity" in the histogram layer
    • use position="dodge2" to position the bars next to each other instead of superimposed
    • use some transparency if you don’t like super bright colors
    • add any other bells and whistles you want
# rate per droplet
mu = cell.droplet.rate
mu
## [1] 0.05
# generate the PDFs
cell.droplet.pmf.n100  = dpois(range, mu)       # 100
cell.droplet.pmf.n1000 = dpois(range, mu * 10)  # 1k
cell.droplet.pmf.n2000 = dpois(range, mu * 20)  # 2k

# Create a data frame with one column for the range vector and one for each dataset
q1c.data.frame <- data.frame(Cell_count = range,
                             n100  = cell.droplet.pmf.n100,
                             n1000 = cell.droplet.pmf.n1000,
                             n2000 = cell.droplet.pmf.n2000)

# look at the first few lines of the data.frame
head(q1c.data.frame)
##   Cell_count         n100        n1000       n2000
## 1          0 9.512294e-01 0.6065306597 0.367879441
## 2          1 4.756147e-02 0.3032653299 0.367879441
## 3          2 1.189037e-03 0.0758163325 0.183939721
## 4          3 1.981728e-05 0.0126360554 0.061313240
## 5          4 2.477160e-07 0.0015795069 0.015328310
## 6          5 2.477160e-09 0.0001579507 0.003065662
# Melt the dataframe for ggplot to convert the data to "long" format
# This allows us to group the data by Concentration for plotting
# Specify the column names:
q1c.data.frame.melted <- melt(q1c.data.frame,
                              id.vars = "Cell_count",            # x-axis
                              value.name = "Density",            # y-axis
                              variable.name = "Concentration")   # categories (plot key)

# look at the first few lines of the melted data frame
head(q1c.data.frame.melted)
##   Cell_count Concentration      Density
## 1          0          n100 9.512294e-01
## 2          1          n100 4.756147e-02
## 3          2          n100 1.189037e-03
## 4          3          n100 1.981728e-05
## 5          4          n100 2.477160e-07
## 6          5          n100 2.477160e-09
# Make the plot
ggplot(q1c.data.frame.melted, aes(x=Cell_count, y=Density, fill=Concentration)) +
  geom_histogram(stat="identity", position="dodge2", alpha=0.5) +
  scale_fill_manual(values = c("goldenrod","red","blue")) +
  theme_classic() +
  scale_x_continuous(breaks=pretty_breaks()) +
  xlab("Cells per droplet")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

d. Droplets per experiment

Assuming experiment produces one million (1e6) droplets, convert the density plot above to a histogram showing the number of droplets per experiment that contain 0-10 cells per droplet for each concentration. Add a label to the y-axis indicating that it now represents the number of droplets instead of the density.

  • Hint: You can change the scale of the y-axis directly in ggplot without changing your data frame!
# Make the plot
ggplot(q1c.data.frame.melted, aes(x=Cell_count, y=Density*1e6, fill=Concentration)) +
  geom_histogram(stat="identity", position="dodge2", alpha=0.5) +
  scale_fill_manual(values = c("goldenrod","red","blue")) +
  theme_classic() +
  ylab("Number of droplets") +
  scale_x_continuous(breaks=pretty_breaks())
## Warning: Ignoring unknown parameters: binwidth, bins, pad

e. Expected droplet counts per experiment

To make the next few steps easier, now go ahead and add a column to your melted data frame containing the number of droplets out of 1e6 corresponding to the density for each datapoint. Take the sum of the droplet counts for each concentration to make sure this column adds up correctly.

q1c.data.frame.melted = q1c.data.frame.melted %>%  mutate(Expected_droplets = Density*1e6)
head(q1c.data.frame.melted)
##   Cell_count Concentration      Density Expected_droplets
## 1          0          n100 9.512294e-01      9.512294e+05
## 2          1          n100 4.756147e-02      4.756147e+04
## 3          2          n100 1.189037e-03      1.189037e+03
## 4          3          n100 1.981728e-05      1.981728e+01
## 5          4          n100 2.477160e-07      2.477160e-01
## 6          5          n100 2.477160e-09      2.477160e-03
q1c.data.frame.melted %>% filter(Concentration == "n100")  %>%  select(Expected_droplets) %>%  sum
## [1] 1e+06
q1c.data.frame.melted %>% filter(Concentration == "n1000") %>%  select(Expected_droplets) %>%  sum
## [1] 1e+06
q1c.data.frame.melted %>% filter(Concentration == "n2000") %>%  select(Expected_droplets) %>%  sum
## [1] 1e+06

f. How many droplets are expected to contain exactly 1 cell?

Select just the rows with 1 cell, and then plot the number of droplets that are expected to contain only one cell for each concentration. Instead of using geom_histogram(), try using geom_col().

# filter the data
q1f.melted.1cell <- filter(q1c.data.frame.melted, Cell_count == 1)
q1f.melted.1cell
##   Cell_count Concentration    Density Expected_droplets
## 1          1          n100 0.04756147          47561.47
## 2          1         n1000 0.30326533         303265.33
## 3          1         n2000 0.36787944         367879.44
# Plot the data as a bar plot
ggplot(q1f.melted.1cell, 
       aes(x=Concentration, y=Expected_droplets, fill=Expected_droplets)) +
  geom_col() +
  theme_classic()

Which concentration produces the largest number of droplets with one cell?

# your answer here
200 cells / microliter gives the largest number of droplets with exactly one cell.

g. Minimizing doublets

The optimal cell concentration maximizes the number of droplets with a single cell while at theh same time minimizing the rate of ‘doublets’, which have 2 or more cells in a single droplet.

Plot the number of droplets that have 1 cell and the number that have 2+ cells for each experimental concentration. To do this, we suggest the following steps (but you can do this any way you want):

  • Extract the rows containing more than one cell from the melted data frame, and create a new data frame containing just those rows with 2+ cells.
  • Convert this to a data frame containing just the sum totals for each concentration.
    • Try using the aggregate function to do this, using the formula . ~ Concentration.
  • Create a new data frame with the combined 1-cell and 2+ cell totals for each concentration.
    • You’ll notice that the “Cell_count” column contains the sum of the cell counts, so change all the values in this column to “2+” instead.
  • Plot the results using ggplot as per the last question above.

Whatever way you choose to do this, make sure to comment your code!

# filter the data
q1g.melted.2_plus_cells = filter(q1c.data.frame.melted, Cell_count > 1)
q1g.melted.2_plus_cells
##    Cell_count Concentration      Density Expected_droplets
## 1           2          n100 1.189037e-03      1.189037e+03
## 2           3          n100 1.981728e-05      1.981728e+01
## 3           4          n100 2.477160e-07      2.477160e-01
## 4           5          n100 2.477160e-09      2.477160e-03
## 5           6          n100 2.064300e-11      2.064300e-05
## 6           7          n100 1.474500e-13      1.474500e-07
## 7           8          n100 9.215625e-16      9.215625e-10
## 8           9          n100 5.119792e-18      5.119792e-12
## 9          10          n100 2.559896e-20      2.559896e-14
## 10          2         n1000 7.581633e-02      7.581633e+04
## 11          3         n1000 1.263606e-02      1.263606e+04
## 12          4         n1000 1.579507e-03      1.579507e+03
## 13          5         n1000 1.579507e-04      1.579507e+02
## 14          6         n1000 1.316256e-05      1.316256e+01
## 15          7         n1000 9.401827e-07      9.401827e-01
## 16          8         n1000 5.876142e-08      5.876142e-02
## 17          9         n1000 3.264523e-09      3.264523e-03
## 18         10         n1000 1.632262e-10      1.632262e-04
## 19          2         n2000 1.839397e-01      1.839397e+05
## 20          3         n2000 6.131324e-02      6.131324e+04
## 21          4         n2000 1.532831e-02      1.532831e+04
## 22          5         n2000 3.065662e-03      3.065662e+03
## 23          6         n2000 5.109437e-04      5.109437e+02
## 24          7         n2000 7.299195e-05      7.299195e+01
## 25          8         n2000 9.123994e-06      9.123994e+00
## 26          9         n2000 1.013777e-06      1.013777e+00
## 27         10         n2000 1.013777e-07      1.013777e-01
# sum up the totals by group using the aggregate function
q1g.melted.2_plus_cells = aggregate(. ~ Concentration, q1g.melted.2_plus_cells, sum)
q1g.melted.2_plus_cells
##   Concentration Cell_count     Density Expected_droplets
## 1          n100         54 0.001209104          1209.104
## 2         n1000         54 0.090204010         90204.010
## 3         n2000         54 0.264241108        264241.108
# relabel the cell counts as "2+"
q1g.melted.2_plus_cells$Cell_count = "2+"
q1g.melted.2_plus_cells
##   Concentration Cell_count     Density Expected_droplets
## 1          n100         2+ 0.001209104          1209.104
## 2         n1000         2+ 0.090204010         90204.010
## 3         n2000         2+ 0.264241108        264241.108
# combine the 1 and 2+ data
q1g.plot.data = rbind(q1f.melted.1cell, q1g.melted.2_plus_cells)
q1g.plot.data
##   Cell_count Concentration     Density Expected_droplets
## 1          1          n100 0.047561471         47561.471
## 2          1         n1000 0.303265330        303265.330
## 3          1         n2000 0.367879441        367879.441
## 4         2+          n100 0.001209104          1209.104
## 5         2+         n1000 0.090204010         90204.010
## 6         2+         n2000 0.264241108        264241.108
# Plot the combined data as a bar plot
ggplot(q1g.plot.data, aes(x=Concentration, y=Expected_droplets, fill=Cell_count)) +
  geom_col(position="dodge2") + # plot side-by-side
  theme_classic()

Which experimental concentration has the lowest doublet rate? Which concentration would you choose for your experiment?

# your answer here
The 100 cells / microliter concentration has the lowest doublet rate. Even though you get a lot more droplets with one cell/ul at 1000 or 2000 cells, the rate of doublets is also higher.

Since the goal is to minimize doublets, we will take a hit in total droplets with 1 cell. This tradeoff is worth it because otherwise we will not be able to deconvolute the data.

h. Chi-square goodness-of-fit test

Above, we simulated idealized data for this experiment for the purposes of illustration. If we could optically measure the number of cells per droplet in an actual experiment, then we could see whether the data actually fit the expected distribution.

Let’s say that you’ve used a cell counter at the start of the experiment to estimate the concentration of cells. You targeted to get a concentration of 100 cells per microliter, but the actual concentration was 110 cells per microliter.

Now you want to see how well the actual distribution of cells per droplet will match the expected distribution.

  • First, take 1e6 samples from a Poisson with a rate constant based on the actual concentration of cells per microliter (instead of the ideal 100 cells/ul).
  • Make a table of the observed number of cells per droplet.
  • Use xtabs() to make a table of the expected number of cells per droplet for the ideal n100 distribution.
    • You can use the formula Expected_droplets ~ Cell_count here.
# sample 1M droplets from the Poisson distribution
sample.rpois = rpois(1e6, lambda=0.11 * 0.5)
head(sample.rpois)
## [1] 0 0 0 0 0 1
# make a frequency table from the sampled data
sample.rpois.table = table(sample.rpois)
sample.rpois.table
## sample.rpois
##      0      1      2      3 
## 946807  51670   1496     27
# filter the n100 data from the ideal distributions
n100.data = q1c.data.frame.melted %>% filter(Concentration == "n100") %>% select(c(Cell_count, Expected_droplets))

# use xtabs to make a table of the ideal n100 data
ideal.rpois.table = xtabs(Expected_droplets ~ Cell_count, data = n100.data)
ideal.rpois.table
## Cell_count
##            0            1            2            3            4            5 
## 9.512294e+05 4.756147e+04 1.189037e+03 1.981728e+01 2.477160e-01 2.477160e-03 
##            6            7            8            9           10 
## 2.064300e-05 1.474500e-07 9.215625e-10 5.119792e-12 2.559896e-14

Since the sampled data will have only up to 3 or 4 cells per droplet, you will notice that the two tables are not the same size! To generate the table we need for the Chi-square test, you will need to truncate the “ideal” table to match the length of the sampled data.

Go ahead and this, and then join them together into a single contingency table. You should end up with a table containing two rows and around 3-4 columns.

# truncate the table with the ideal data to the correct length
ideal.rpois.table = ideal.rpois.table[1:length(sample.rpois.table)]
ideal.rpois.table
## Cell_count
##            0            1            2            3 
## 951229.42450  47561.47123   1189.03678     19.81728
# bind the data together into a single table
chisq.data = rbind(sample.rpois.table, 
                   ideal.rpois.table)
chisq.data
##                           0        1        2        3
## sample.rpois.table 946807.0 51670.00 1496.000 27.00000
## ideal.rpois.table  951229.4 47561.47 1189.037 19.81728

Plot the distribution of cell counts per droplet for the observed and expected data. How different do they look?

# make a bar plot to compare these visually
barplot(chisq.data, beside=T, 
        col=c("aquamarine3","coral"))
legend("topright", c("Observed","Expected Poisson"), pch=15, 
       col=c("aquamarine3","coral"), 
       bty="n")

Finally, perform a Chi-squared test of the observed vs. expected cell counts.

# do the chi-squared test
chisq.test( chisq.data )
## 
##  Pearson's Chi-squared test
## 
## data:  chisq.data
## X-squared = 216.61, df = 3, p-value < 2.2e-16
chisq.test( chisq.data )$p.value
## [1] 1.086926e-46

How do your sampled data compare to the ideal data in the Chi-squared test?

# your answer here
Even though the distributions look pretty close by eye, for most samples they will be highly statistically different. In fact sometimes the sampled data will contain even fewer samples with o or 2+ cells (even though the lambda is a little bit higher), just due to differences between random samples.

However, it looks like there should still be very few droplets with more than one cell, so you can probably go ahead with the experiment anyway (or you could just dilute your sample a little bit instead.)

This is a case where even though the statistics may be significant, the overall magnitude of the differences is relatively small, so in practice they won't matter that much.

Over the last few years continuous improvements in single-cell technology have been made to increase the proportion of droplets containing a single cell as well as throughput. Alternative methods such as split-pooling have also been developed that can be scaled to accommodate very large samples, and a variety of methods have been developed to allow multiplexing of multiple samples.

Q2: Mice high-fat diet study

In a previous homework, we looked at the mouse high-fat diet dataset and computed confidence intervals for multiple samples of 12 mice from the control population. Since we didn’t know about the \(t\)-distribution, we just used a \(z\)-distribution to calculate confidence intervals.

You may or may not have noticed that when you re-ran the code for finding the 95% CI for samples from the control population, that usually slightly more than 5 samples out of 100 did not contain the true population mean. This is because 95%CIs for the \(z\)-distribution were too narrow and did not take into account the variability of small samples!

The dataset attached to this homework contains the cleaned dataset we made in that homework. First, load the dataset and create a vector containing just the weights for the male control population.

# load the dataset
mice.pheno = read.csv("mice.pheno.cleaned.csv")

# subset the bodyweights of the **male** mice fed the **chow** diet
chow = mice.pheno[which(mice.pheno$Sex == "M" & mice.pheno$Diet == "chow"),"Bodyweight"]

Below is reproduced the code from that homework for visualizing 100 95% confidence intervals, using just the chow data:

# ============================================================================ #
n.samples = 100   # number of random samples
N = 12   # sample size
Q = qnorm(0.975)  # z-score for +/- 47.5% of a normal distribution

# set up a plot showing a vertical line with the true mean for each population
# rename chow and hf to reflect that we are using these as the population
chowPop = chow
hfPop = hf
## Error in eval(expr, envir, enclos): object 'hf' not found
plot(mean(chowPop)+c(-7,7),c(1,1),type="n",
     xlab="weight",ylab="interval",ylim=c(1,n.samples))
abline(v=mean(chowPop))

# display the 95% CI for a bunch of random samples
#ci_range = c() # initialize empty vector
for (i in 1:n.samples) {
  chow.sample = sample(chowPop,N)          # take a random sample
  se = sd(chow.sample)/sqrt(N)             # compute SEM
  interval = c(mean(chow.sample)-Q*se, mean(chow.sample)+Q*se) # compute 95% CI
#  ci_range[i] = interval[2] - interval[1]  # record size of ci interval
  
  # draw the 95% CI -- color it green if it contains the true population mean, or red if not
  covered = mean(chowPop) <= interval[2] & mean(chowPop) >= interval[1]
  color = ifelse(covered,"forestgreen","red")
  lines(interval, c(i,i), col=color)
}

#summary(ci_range)  # display summary of the distribution of CI ranges
# ============================================================================ #

Re-run the above code block a bunch of times. Do you notice that usually more than 5 CI’s don’t contain the true mean?

a. Create a function to count the number of 95%CI’s that do not contain the true population mean

Now turn the for loop above into a function called ci.fail.counter. Instead of having the code add a line to a graph, modify it so that all it does is return the number of CI’s out of 100 that DO NOT contain the true population mean.

The function should take the following parameters:

  • chowPop: Data for an entire population (vector)
  • N: the sample size (default = 12)
  • n.samples: the number of samples to take (default = 100)

Note:

  • You should create a counter variable to hold the total number of CI’s that don’t contain the population mean; each time you find one, add 1 to this counter.
    • To do this, define an integer variable outside the loop and set it to 0.
  • You should define and set Q inside the function so it doesn’t depend on any other outside information.
ci.fail.counter = function(chowPop, N=12, n.samples=100) {

  counter = 0  # create a counter
  Q = qnorm(0.975)  # z-score for +/- 47.5% of a normal distribution

  for (i in 1:n.samples) {
    chow.sample = sample(chowPop,N)          # take a random sample
    se = sd(chow.sample)/sqrt(N)             # compute SEM
    interval = c(mean(chow.sample)-Q*se, mean(chow.sample)+Q*se) # compute 95% CI
    
    # flag the 95% CI as either covering or not covering the true population mean
    covered = mean(chowPop) <= interval[2] & mean(chowPop) >= interval[1]
    
    # add to counter if CI doesn't cover the population mean
    if (!covered) {counter = counter+1}
  }

  # return the total number of CI's that don't contain the population mean
  return(counter)
}

b. Test the function and tally up the failed CI’s

Below, test your function by calling it once.

When everything looks ok, run the function 100 times and record the number of CI’s that don’t cover the true population mean each time.

  • Hint: you may do this in a couple of different ways; do whatever makes the most sense to you.

Check the results by making a table of the counts.

# call the function once
ci.fail.counter(chowPop = chowPop)
## [1] 7
# initialize an empty vector
ci.fail.count = c()

# run the function 100 times and fill up the vector with the results from each run

## option 1: use replicate function
ci.fail.count = replicate(100, ci.fail.counter(chowPop), simplify = TRUE)

## option 2: use a for loop
for (i in 1:100) {
  ci.fail.count[i] = ci.fail.counter(chowPop)

}

# take a peek at the first few rows of the vector
head(ci.fail.count)
## [1]  8 11  4  7  5  7
# make a table to check the results
table(ci.fail.count)
## ci.fail.count
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 
##  2  1  5  9 13 19 12 10 12  7  6  2  1  1

c. Make a histogram of the CI failure rate

Make a quick plot of the number of 95%CIs per 100 CI’s that don’t contain the true population mean. How does the distribution look? What are the mean and median failed CI’s using the \(z\)-distribution?

hist(ci.fail.count, col="peachpuff", 
     xlab="Number of failed 95%CIs", 
     main="95%CI failure rate per 100 using z-score")

mean(ci.fail.count)
## [1] 6.93
median(ci.fail.count)
## [1] 7

d. Modify your function using a \(t\)-distribution

Copy your function above and modify it so that instead of using the \(z\)-distribution, it uses the 95% quantiles for a \(t\)-distribution with the right number of d.f.

ci.fail.counter = function(chowPop, N=12, n.samples=100) {

  counter = 0  # create a counter
  Q = qt(0.975, df = N-1)  # z-score for +/- 47.5% of a normal distribution

  for (i in 1:n.samples) {
    chow.sample = sample(chowPop,N)          # take a random sample
    se = sd(chow.sample)/sqrt(N)             # compute SEM
    interval = c(mean(chow.sample)-Q*se, mean(chow.sample)+Q*se) # compute 95% CI
    
    # flag the 95% CI as either covering or not covering the true population mean
    covered = mean(chowPop) <= interval[2] & mean(chowPop) >= interval[1]
    
    # add to counter if CI doesn't cover the population mean
    if (!covered) {counter = counter+1}
  }

  # return the total number of CI's that don't contain the population mean
  return(counter)
}

e. Compute the rate of failed CI’s

Repeat part (b) using the new version of your function.

ci.fail.count = c()

for (i in 1:100) {
  ci.fail.count[i] = ci.fail.counter(chowPop)
 
}
head(ci.fail.count)
## [1] 7 8 7 4 3 4
table(ci.fail.count)
## ci.fail.count
##  1  2  3  4  5  6  7  8  9 11 
##  2  8 14 21 20 16  8  5  5  1

f. Make a histogram of the results.

Repeat part (c) using the new data based on the \(t\)-distribution.

hist(ci.fail.count, col="skyblue", 
     xlab="Number of failed 95%CIs", 
     main="95%CI failure rate per 100 using t-score")

mean(ci.fail.count)
## [1] 4.92
median(ci.fail.count)
## [1] 5

How do these results look in comparison to your earlier results?

# your answer here
Using the z-score typically gives a mean/median failure rate of around 7%.

Using the t-score gives a failure rate much closer to 5% - in fact, re-running this a bunch of times oten finds the mean and median to be closer to 4! This suggest that the t-distribution may actually slightly overestimate the width of the CI much of the time.